Titanic Survival Analysis

The following analysis attempts to answer the question; what passenger charectoristics infuenced the likelyhood of surviving the sinking of the Titanic?

The dataset was provided by Udacity.

In [94]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [95]:
#Titanic dataset provided by Udacity 
FileName = 'titanic_data.csv'
fileNameTitanic = '/Users/Preston/Programing/Data Science/Udacity Nano Degree/project 2/' + FileName
titanicStats = pd.read_csv(fileNameTitanic)

Data Set

The original data set (a subset of which is show below) includes information on 891 passengers. The factors that I will focus on are gender, passenger class and age.

In [96]:
titanicStats.tail()
Out[96]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

Missing Values

The following code shows that, while data for the gender and passenger class were collected for each passenger in the data set, passenger age was not available for 177 passengers in the data set.

In [97]:
# Find all of the null fields in the data set. 
titanicStats.isnull().sum()
Out[97]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Formulas for analysis

I have created several formulas to simplify my analysis. The first of which will take the original data set and group it by one of the selected factors (column headings). The following calculate chi squared and Cramer's V.

In [98]:
'''Function to create a dataframe for the relevant subcategory.  
    @param send in the primary dataframe and the column name to group by. 
    @return a dataframe organized with columns, survived, perished and totals. 
'''
def new_dataframe(old_frame, grouping_column):
    # group the frame by the selected column and pull only the survial numbers. 
    newFrame = old_frame.groupby(grouping_column)['Survived'].agg(['sum', 'count'])
    # change the column names. 
    newFrame.columns = ['Survived', 'Total']
    # add a column for the total numbers. 
    newFrame['Perished'] = newFrame['Total'] - newFrame['Survived']
    newFrame['Survival_Rate'] = newFrame['Survived'] / newFrame['Total']
    # reorder the columns. 
    newFrame = newFrame[['Survived', 'Perished', 'Total', 'Survival_Rate']] 
    
    return newFrame
In [99]:
'''Chi squared function

    @param observed_values observed values as a tuple; number survived, number perished. 
    the dictionary sent to the equation should have the numbrer of survivers in the first
    columnn and number perished in the second column. The keys should be the dependent variables. 
    @return chi sqaured value of the survival rate for the dictionary. 
'''
def chi_squared_survival(observed_dict):
    # the running total for the chi squared statistic
    chiSquared = 0
    
    for i in range(len(observed_dict)):
        obs_survived = observed_dict.iloc[i, 0]
        obs_perished = observed_dict.iloc[i, 1]
        totalPeople = obs_survived + obs_perished
        # the expected to survive
        expect_survived = totalPeople * total_survival_odds
        expect_perished = totalPeople * (1 - total_survival_odds)
        
        chiSquared += ((obs_survived - expect_survived)**2) / expect_survived
        chiSquared += ((obs_perished - expect_perished)**2) / expect_perished
        
    return chiSquared
In [100]:
'''Cramer's V function

    @param input the chi squared score, degrees of freedom, total samples observed.
    @return Cramer's v score
'''
import math

def cramerV(chiScore, totalObserved, k):
    return math.sqrt(chiScore / (totalObserved * (k - 1)))

Total Survival Odds

The overall odds of survival for the 891 passengers in the data set was 38.38%. This average survival rate will be compared with several subgroups of passengers.

In [101]:
total_survival_odds = titanicStats.Survived.sum() / float(titanicStats.Survived.count())
total_survival_odds
Out[101]:
0.3838383838383838

Survival By Gender

The following code evaluates the survival rates for each gender.

Based on the 55.3% difference in survival rates between men (18.9%) and women (74.2%), it appears that gender was a large factor in determining the likelyhood of survival.

In [102]:
# Create a dataframe for the gender survival statistics. 
genderStats = new_dataframe(titanicStats, 'Sex')
genderStats
Out[102]:
Survived Perished Total Survival_Rate
Sex
female 233 81 314 0.742038
male 109 468 577 0.188908
In [103]:
# Differnece in survival rates. 
genderStats.Survival_Rate['female'] - genderStats.Survival_Rate['male']
Out[103]:
0.55313007097992029
In [104]:
# Survival by gender bar chart.
genderStats.Survival_Rate.plot(kind='bar', title='Survival Rate')
Out[104]:
<matplotlib.axes._subplots.AxesSubplot at 0x11dfb5090>

Chi Squared by Gender

To determine the statistical significance of the gender based survival rate, I utilized a Chi Squared test.

Null hypothesis: gender was not a significant factor in determining survival rates
Alternative hypothesis: gender was a statistically significant factor in determining survival rates
Alpha level of .05.

Based on the chi squared score of 263.05 and degrees of freedom of one, the p value is less than 0.0001 (per graphpad.com). As such the null hypothesis is rejected, and gender considered a statisically significant factor in determining survial rate.

Additionally, based on the Cramer's V value of .54 (calculated below), gender can be considered to have a strong effect on survival rates.

In [105]:
# Calculate the Chi Squared score for survival rates based on gender.  
chi_squared_gender = chi_squared_survival(genderStats)
chi_squared_gender
Out[105]:
263.05057407065567
In [106]:
# Calculate Cramer's V for the survival rates based on gender. 
cramerV(chi_squared_gender, genderStats['Total'].sum(), 2)
Out[106]:
0.5433513806577551

Survival Rates By Passenger Class

The following analysis compares the survival rate for each of the three ticket classes on board.

In [107]:
# Analyse the survival rates by passenger class.
classStats = new_dataframe(titanicStats, 'Pclass')
classStats
Out[107]:
Survived Perished Total Survival_Rate
Pclass
1 136 80 216 0.629630
2 87 97 184 0.472826
3 119 372 491 0.242363

As the following two tables illustrate, there were more third class passengers than first or second class combined. However, the survival rate declined from first to second class then further to third class.

In [108]:
# plot of the total number of passengers by passenger class. 
classStats.Total.plot(kind='pie', title='Passenger Count by Class')
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e22e890>
In [109]:
# Plot of survival odds by passenger class. 
classStats.Survival_Rate.plot(kind='bar', title='Survival Rate')
Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x11dfa6cd0>

Chi Squared for Passenger Class

To determine the statistical significance of the survival rates of each passenger class, I utilized a Chi Squared test.

Null hypothesis: passenger class was not a significant factor in determining survival rates
Alternative hypothesis: passenger class was a significant factor in determining survival rates
Alpha level of .05.

Based on the chi squared score of 102.89 and degrees of freedom of two, the p value is less than 0.0001 (per graphpad.com). As such the null hypothesis is rejected and passenger class is considered a statisically significant factor in determining survial.

Based on the Cramer's V value of .34 (calculated below), passenger class can be considered to have a medium effect on survival rates.

In [110]:
# Calculate Chi Squared for the survival rates based on passenger class. 
chi_by_class = chi_squared_survival(classStats)
chi_by_class
Out[110]:
102.88898875696057
In [111]:
# Calculate Cramer's V for survival rates by class.
cramerV(chi_by_class, classStats.Total.sum(), 2)
Out[111]:
0.33981738800531175

Survival Rate by Age

The following analyzes the survival rates by age group of the passengers.

For the 714 passengers with available age data, the mean age is 29.7 years old, with median of 28, minimum of 0.42 (5 months) and maximum of 80.

In [112]:
# General description of the age data and the median value. 
titanicStats.Age.describe(), titanicStats.Age.median()
Out[112]:
(count    714.000000
 mean      29.699118
 std       14.526497
 min        0.420000
 25%       20.125000
 50%       28.000000
 75%       38.000000
 max       80.000000
 Name: Age, dtype: float64, 28.0)
Age by Decade

To analyze the survival rates based on age, I seperated the passengers into categories consisting of ages by decade.

The function below returns the decade of the passengers age, this used to add the respective decate category to the data frame. I then created a new data frame slice and removed passengers with null age values.

In [113]:
'''Function to determine the age of the passengers by decade.  
    @param age of passenger
    @return decade of passengers age as a string. 
'''

def ageDecade(age):
    if age <= 9:
        return '0-9'
    elif age < 20:
        return '10-19'
    elif age < 30: 
        return '20-29'
    elif age < 40: 
        return '30-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    elif age < 70:
        return '60-69'
    else:
        return '70+'
In [114]:
# Add the ageDecade column to the dataframe.
titanicStats['ageDecade'] = titanicStats['Age'].map(ageDecade)

# Create a slice of the dataframe and remove anything with a null value in the age column. 
ageDf = titanicStats.dropna(subset=['Age'])
ageDf.tail()
Out[114]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked ageDecade
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.125 NaN Q 30-39
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.000 NaN S 20-29
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.000 B42 S 10-19
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.000 C148 C 20-29
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.750 NaN Q 30-39
In [115]:
# Check the length of the new data frame to make sure that the 177 rows with null age fields were removed. 
print 'Original table length = ', len(titanicStats)
print 'New table length = ', len(ageDf)
print "Difference = ", len(titanicStats) - len(ageDf)
Original table length =  891
New table length =  714
Difference =  177

Survival Rates by Age in Decades

The majority of the age groups have survival rates that are close to the average total survival rate (38.4%), with the exceptions of the 0-9 age group and over 70 age group, which are respecitively higher and lower than the overall average survival rate. However, there are two few samples in the 70+ group to draw any conclusions from (only 7 passengers over 70 years of age). What remains is the apperance of a 'privelidge of youth' as it relates to survial.

In [116]:
#Create the new dataframe grouped by the age by decades of the passengers.
statsByDecade = new_dataframe(ageDf, 'ageDecade')
statsByDecade
Out[116]:
Survived Perished Total Survival_Rate
ageDecade
0-9 38 24 62 0.612903
10-19 41 61 102 0.401961
20-29 77 143 220 0.350000
30-39 73 94 167 0.437126
40-49 34 55 89 0.382022
50-59 20 28 48 0.416667
60-69 6 13 19 0.315789
70+ 1 6 7 0.142857
In [117]:
# Create a bar chart of the survival rates grouped by decade of age. 
statsByDecade.Survival_Rate.plot(kind='bar', title='Survival Rate')
Out[117]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e432490>

Expiration Date of the Privelidge of Youth

To deterimine if the seeming privelidge of youth was significant I first sought to find the age at which this priveledge expired. To do this I created a new data frame with only the passengers that were under 20 years old and examined the survival rates in the table below compared to the overall survival rate (38.4%).

The priveledge of youth appears to expire after the age of 15, as the survival rate drops from 80% at age 15 to 35% at 16 and stays relatively close to the 38% overall average from that point on.

In [118]:
youthDf = new_dataframe(ageDf[(ageDf.Age <= 19)], 'Age')
youthDf.tail(10)
Out[118]:
Survived Perished Total Survival_Rate
Age
11.0 1 3 4 0.250000
12.0 1 0 1 1.000000
13.0 2 0 2 1.000000
14.0 3 3 6 0.500000
14.5 0 1 1 0.000000
15.0 4 1 5 0.800000
16.0 6 11 17 0.352941
17.0 6 7 13 0.461538
18.0 9 17 26 0.346154
19.0 9 16 25 0.360000
In [119]:
# Create a bar graph of the survival rates based on age for passengers under 20. 
youthDf.Survival_Rate.plot(kind='bar', title='Youth Survival Rates')
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e6334d0>

I then broke the passenges into two groups, those 15 and younger and those 16 and older.

In [120]:
'''Function to test if age is below 16.
    @param age
    @return boolean.
'''
def under_16(age):
    return age < 16
In [121]:
# Create a new field in the data frame 'Under_16'.
titanicStats['Under_16'] = titanicStats['Age'].map(under_16)

# Remove any passengers with null age data.  
youthDf = titanicStats.dropna(subset=['Age'])

# Create a new data frame based on the Under_16 categories.  
youth_vs_Df = new_dataframe(youthDf, 'Under_16')
youth_vs_Df
Out[121]:
Survived Perished Total Survival_Rate
Under_16
False 241 390 631 0.381933
True 49 34 83 0.590361

Chi Squared for Passenger Age Group

To determine the statistical significance of the survival rates for passenger by age, under 16 compared to 16 and over, I utilized a Chi Squared test.

Null hypothesis: passenger age was not a significant factor in determining survival rates
Alternative hypothesis: passenger age was a statistically significant factor in determining survival rates
Alpha level of .05.

Based on the chi squared score of 14.98 and degrees of freedom of one, the p value is less than 0.0001 (per graphpad.com). As such the null hypothesis is rejected and passenger age is considered a statisically significant factor in determining survial.

Based on the Cramer's V value of .14 (calculated below), passenger age can be considered as having a small effect on survival rates.

In [122]:
# Calculate the Chi Squared value for age group survival rate, under 16 vs. 16 and up. 
chi_youth_vs = chi_squared_survival(youth_vs_Df)
chi_youth_vs
Out[122]:
14.977970720971808
In [123]:
# Calculate Cramer's V for survival by age group.
cramerV(chi_youth_vs, youth_vs_Df.Total.sum(), 2)
Out[123]:
0.1448362869911138

Combined Variables

Combining the passenger variables reveals interesting interactions between the variables.

Privelidge of Youth vs Class vs Gender

Under 16

While it does appear that passengers under 16 in the first and second classes were favored, with only 1 out of 25 perrishing, 25 passengers in those categories are too few to draw conclusions from.

For those under 16 in the third class, youth was a notable advantage for males, with 32.1% survival rate for those under 16 compared to 12.9% for those older, and a smaller advantage for females with a 53.3% survival rate for third class females under 16 compared 43.1% for older females in third class.

Over 16

For men older than 16, the only way to have had a reasonable chance at survival was to have been in first class. Men in first class had a 37.8% survival rate compared to 6.7% and 12.9% for men over 16 in second and third class, respectively.

For women over 16, class was a major factor in survival. Although women over 16 in third class had a survival rate of 43.1%, which was well above the men's overall survival rate (18.9%), it was still less than half of the survival rate for women over 16 in first and second class (97.6% and 90.6%, respectively).

In [124]:
'''Remove the null age data and create a data frame table with the three factors combined;
    age under/over 16 years, gender, and passenger class.'''

ageOnlyDf = titanicStats.dropna(subset=['Age'])
combined_factors = new_dataframe(ageOnlyDf, ['Under_16', 'Sex', 'Pclass'])

combined_factors
Out[124]:
Survived Perished Total Survival_Rate
Under_16 Sex Pclass
False female 1 80 2 82 0.975610
2 58 6 64 0.906250
3 31 41 72 0.430556
male 1 37 61 98 0.377551
2 6 84 90 0.066667
3 29 196 225 0.128889
True female 1 2 1 3 0.666667
2 10 0 10 1.000000
3 16 14 30 0.533333
male 1 3 0 3 1.000000
2 9 0 9 1.000000
3 9 19 28 0.321429

Survival Rates:

For comparison with the above table, survival previously calculated survival rates are displayed here:

Overall = 38.4%
Male = 18.9%
Female = 74.2%
1st class = 63.0%
2nd class = 47.3%
3rd class = 24.3%

Conclusion

Based on the analysis above, gender, age and passenger class all influenced the survival rate for passengers on the Titanic.

To have the best chance of survival, you would have wanted to be female and in first (or second) class. If you were male, you did not want to be in second or third class especially if you were over 15 years old.